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Abstract 

This paper develops a data reference modeling technique to 
estimate with high accuracy the cache miss ratio in cache- 
coherent multiprocessors. The technique involves analyz- 
ing the dynamic data referencing behavior of parallel al- 
gorithms. Data reference modeling first identifies different 
types of shared data blocks accessed during the execution 
of a parallel algorithm, then captures in a few parameters 
the cache behavior of each shared block as a function of 
the problem size, number of processors, and cache line size, 
and finally constructs an analytical expression for each algo- 
rithm to estimate the cache miss ratio. Because the number 
of processors, problem size, and cache line size are included 
as parameters, the expression for the cache miss ratio can 
be used to predict the performance of systems with differ- 
ent configurations. Six parallel algorithms are studied, and 
the analytical results compared against previously published 
simulation results, to establish the confidence level of the 
data reference modeling technique. It is found that the aver- 
age prediction error for four out of six algorithms is within 
five percent and within ten percent for the other two. The 
paper also derives from the model several results on how 
cache miss rates scale with system size. 



1 Introduction 

An early phase in the design of multiprocessor systems is the 
definition of target applications to be run on the system along 
with potential hardware configurations. Of the various pos- 
sible hardware configurations, the configuration that yields 
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the best performance for a given cost is typically selected for 
the final system design specification. Because the cache is 
a critical determinant of multiprocessor performance, a sim- 
ple analytical model of cache behavior that can rapidly yield 
cache miss rates for various parallel algorithms as a function 
of system and problem parameters, such as the number of 
processors and problem size, is extremely desirable during 
the early definition stage of the design process. 

This paper develops a data reference modeling method- 
ology to analyze parallel algorithms and obtain information 
that can be used to estimate the cache miss ratio in multi- 
processor systems. Because the model captures the problem 
size, the system size, and the cache line size as parameters, it 
can be used to predict cache miss ratios for different system 
configurations. However, because the method is based on 
an analysis of parallel algorithms, it is not suitable for the 
analysis of complex or irregular applications, where the data 
reference patterns are hard to discern. 

The data reference modeling technique consists of the 
following steps: 

1. Identifying different types of shared data blocks ac- 
cessed during the execution of a parallel algorithm 
by each processor. This technique is most convenient 
when the types of shared blocks accessed by each pro- 
cessor is the same for all processors - a behavior com- 
monly exhibited by data parallel applications (i.e., ap- 
plications in which each processor executes the same 
code but operates on different data sets). 

2. Capturing in a few parameters the cache behavior of 
each shared block as a function of the problem size, 
number of processors, and cache line size. We assume 
that the partitioning strategy is an algorithm-specific 
property. The parameters essentially characterize a type 
oi processor locality inherent in each type of sharing. 
Processor locality was described in [2] as the tendency 
of a processor to repeatedly access a given block of data 
before an access by a remote processor. A similar form 
of locality was also measured by Eggers [6] using the 
notion of write runs, and by Dubois and Wang [5] using 



the notion of an access burst. 

The three parameters used to capture the processor lo- 
cality are the number of accesses, a, of a specific type 
of shared block by a given processor, the number of 
remote writes, w, to that block that result in cache 
misses suffered by the given processor, and the num- 
ber of first-time fetches, /, of that block of data that 
are not preceded by a remote write. The parameter / 
contributes to the startup miss cost, and only affects 
the cache miss ratio in the first iteration of typical it- 
erative algorithms. Therefore, / may be ignored when 
the algorithm executes many iterations. Notice that the 
ratio a/w yields a measure of the average number of 
uninterrupted accesses (i.e., a series of local accesses 
without any remote write) to a block of data follow- 
ing an eviction from a given processor's cache due to a 
remote write. 

3. Constructing an analytical expression for each algo- 
rithm to estimate the cache miss ratio. Because the 
processor locality parameters w and a are expressed as 
a function of the number of processors, problem size, 
and cache line size, the expression for the cache miss 
ratio can be used to predict the performance of systems 
with different configurations. 

This paper validates the model using six parallel algo- 
rithms. We find that the average prediction error for four 
out of six algorithms is within five percent and within ten 
percent for the other two. The paper also derives from the 
model several results on how cache miss rates scale with 
block size, number of processors and problem size. 

This paper first discusses related work in Section 2. Sec- 
tion 3 develops the data reference modeling methodology to 
estimate a multiprocessor's cache miss ratio, and validates 
it using previously published simulation data for six parallel 
applications in Section 4. Results for two out of the six ap- 
plications are discussed in detail. Section 5 uses the model 
to study the effects of problem size, cache line size, and 
number of processors on the cache miss ratio. The problem 
size is the size of the shared data structures indicated in the 
parallel algorithm. Section 6 concludes the paper. 



2 Previous Work 

Several previous studies directly relate to our current re- 
search: the independent reference model of Dubois and 
Briggs [4], the processor locality based model of Agar- 
wal [1], the access burst model of Dubois and Wang [5], 
the write-run model of Eggers [6], and the directory model 
of Simoni and Horowitz [9]. 

To our knowledge, the model by Dubois and Briggs [4] 
was the earliest effort on analytically obtaining cache miss 



rates due to invalidation. They estimated miss ratios from a 
markov model assuming that every shared block was equally 
likely to be accessed by any processor in the system. Because 
references to shared memory typically display temporal lo- 
cality in much the same manner as private references do, the 
predicted miss rates turn out to be very pessimistic [1]. 

More recently, researchers have begun using some mea- 
sure of processor locality in their models. The locality based 
model of Agarwal used a measure of the processor locality 
derived from an address trace. Processor locality is derived 
from a measurement of the interval between references to a 
given block of shared data by a given processor and the in- 
terval between write references to that block of data by other 
processors. Cache miss rates for systems with other con- 
figurations are derived using a simple Markov model. The 
drawback with this approach is that while the cache miss 
rate predictions are accurate for the system size from which 
measurements were made, attempts to extrapolate cache be- 
havior for other system sizes are unsuccessful. The reason 
for this difficulty lies in our inability to predict data reference 
patterns arising from algorithmic behavior from a single ad- 
dress trace. An examination of algorithmic behavior, on the 
other hand, directly reveals this information. 

The access burst model is based upon the observation that 
global shared writable blocks are accessed largely in critical 
sections. Within a critical section, a block can be assumed 
to be accessed by a single processor without interruption 
from other processors. A burst is defined to be a duration 
of accesses by a processor to a global shared block in an 
uninterrupted manner. The access burst size is a measure of 
the processor locality of the shared block. The model also 
measures the probability that a block is modified during an 
access burst. The measurements are made for each block 
size and the problem size is fixed. The model assumes that 
after a burst, all processors sharing the block are equally 
likely to access it again, and that the access bursts are in- 
dependent from one another. Using these assumptions and 
measurements, a Markov model representing the global state 
of a shared block is constructed to estimate the likelihood of 
occurrence of each type of cache event such as an invalida- 
tion. The states represent the number of processors sharing 
the block. 

Simoni and Horowitz focused on modeling the perfor- 
mance of limited pointer directory schemes. Their analysis 
models processor locality by assuming that a primary pro- 
cessor is more likely to access a block of data than one of 
several secondary processors. The analysis assumes that the 
number of processors actively accessing a block of data is 
fixed. They chose 64 for this number. 

Our approach is different from that of Simoni and 
Horowitz, and Dubois and Wang, in that we analyze the 
algorithms and derive expressions for a few parameters, as 
a function of problem size, number of processors and block 



nod* 


1 
1 

P 


bn 


bl2 





ba. 




o 
o 
o 














noda 


bii 


bi2 










bin 


o 
o 

o 














noda 


bp: 


bp2 





bpn 

1 p 





Figure 1 : Typical distribution of shared blocks in the caches 
in a multiprocessor system. 

3.2 Notation 

On a P-node shared-memory multiprocessor system, shared 
data blocks are distributed among the caches in the nodes. 
Figure 1 shows a distribution of various types of shared 
blocks in the system. Let there be n, different types of 
shared blocks in the cache in node i. Blocks with identical 
access patterns are said to have the same type. Miss rates 
of blocks with the same type can be captured using a single 
DRM, that is, using a single function of system and problem 
parameters. Let 6,j denote the j-ih type of shared block on 
node i. The types of shared blocks on node i are named bn, 
bi2, ■•■ , bim- A DRM is used to present the state of each 
bij. 

We first introduce the notation for parameters derived from 
an algorithm. The number of remote-writes on bij that in- 
duce cache misses is denoted Wij , and the number of first 
references of 6, j by processor i that are not preceded by a 
remote write (i.e., Wo) event is denoted fij. Let the number 
of accesses of 6,j be captured by oj j . Finally, we will use 
the variable s,j to denote the probability of accessing bij on 
node i; note that s,j is simply atj / Ylj «ii- 

The following are the computed quantities. Let the miss 
ratio of bij be called m, j . The miss ratio of a type of shared 
block is simply the firaction of references of that type of 
block that results in a miss. Misses are caused both by 
remote writes followed by local accesses and due to first- 
time fetches. 

Table 1 summarizes the notation used in this paper. 
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number of processors 


N 


problem size 
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cache block size 
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index of the node count, i = 1, 2 P 
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index of the types of shared blocks on node i 
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number of types of shared blocks on node i 
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probability of accessing bij on node i 
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number of first-time fetches of 6, j 


aij 


number of accesses of bij 


Wij 


number of remote writes that cause 




cache misses on bij 


rriij 


miss ratio of 6,j 


Mi 


cache miss ratio of node i 


M 


cache miss ratio of system 



Table 1: Notation 



fij + Wij 



The parameters aij , wij , and fij , are algorithm dependent 
functions of P,N, and the block size B. A processor's miss 
ratio (Mi) can be derived from the sum of each 6, j 's cache 
miss ratio times the probability of accessing bij . That is. 

Mi = Y^isij X rriij) 

To calculate the average cache miss ratio M of all proces- 
sors in the system, we use M, and the processor utilization 
(Ui) of each node i. The average processor cache miss ratio 
is then: 



M = 



ZLjMi X Ui) 
ELUi ' 



If all processors have the same utilization U, we can write. 



M 






x:m. 



3.3 Processor Cache Miss Ratio 

A processor's cache miss ratio is derived from all DRMs on 
that processor. After all the types of DRMs are identified 
on the processor, we can determine the n, , bij , and Wij from 
the data partition of an algorithm and the accesses sequence 
of shared data during the execution. Then, the miss ratios 
for each 6,j is the ratio of the sum of the number of first 
time misses and the write-invalidate induced misses and the 
number of accesses of the block. 



In our experimental analysis, we will assume that all pro- 
cessors have the same utilization, so that we can make the 
above simplification. 



4 Applications and Validation 

In this paper, we demonstrate the accuracy and relative ease 
of using the data reference modeling approach by compar- 
ing results from simulations and modeling. We will use 



the applications studied by Dubois and Wang [5, 7] both 
because of the availability of simulation results on these ap- 
plications and due to the simplicity of the implementations 
of the parallel algorithms. While Dubois and Wang used 
nine algorithms, we studied six due to the lack of detailed 
published information on the others. The six algorithms we 
study are: Jacobi iteration, successive over relaxation, dy- 
namic parallel quicksort, non-shuffling FFT, shuffling FFT, 
shortest path, and image component labeling. 

The above six algorithms can be further divided into two 
categories: 

• Data independent algorithms: Jacobi, S.O.R., Shuf- 
fling FFT, and Non-shuffling FFT. Data independent 
algorithms are those in which the input data set does 
not affect the shared-data access patterns. 

• Data dependent algorithms: Dynamic quicksort and 
Image component labeling. Data dependent algorithms 
are those in which the input data sets may affect the 
access patterns to shared blocks. 

This paper presents a detailed analysis of one algorithm 
from each category: dynamic parallel quicksort and shuffling 
FFT. For details on the others see [8]. We hope to use 
these two types of algorithms to demonstrate different DRM 
analysis approaches. 

The examples show that the number of types of shared 
blocks for many algorithms, especially for those that are 
iterative, are very few, so the analysis process of construct- 
ing the DRMs for each type of shared block is not onerous. 
Parallel iterative algorithms in which the iterations are dis- 
tributed among the processors not only allow extracting the 
DRMs for shared data types on one processor, but also allow 
us to focus on one iteration of the algorithm. 

By analyzing the application code, we identify all data 
blocks, both shared and private, accessed within each itera- 
tion. The number of references can also be determined from 
the application code for each iteration. Once the DRMs 
for an algorithm are identified, the cache miss ratio can be 
formulated as a function of the cache line size, the problem 
size, and the system size. 

After demonstrating that the model has acceptable ac- 
curacy for two algorithms, we will focus on analyzing the 
effects of cache line size, problem size and system size on 
multiprocessor cache miss rates. Our validation experiments 
compare the miss ratios obtained through analysis with the 
simulation results reported by Dubois and Wang [5, 7]. 

4.1 Shuffling FFT Algorithm 

The shuffling FFT algorithm evaluates the discrete fourier 
transform. Let s(A;), A; = 0, 1,2, ...,N - 1 be A'^ samples of 
a time function. The discrete fourier transform of s{K) is 



defined to be the discrete function a; (j),j = 0, 1,2, ....N-1, 
where 

N-l 



<j) = Yl «(^)e^=^ 



where i = V—^- 

The problem size of the shuffling FFT algorithm is the 
A'^-item array s{k); this array is usually divided into P equal 
sized chunks where P is the number of processors in the 
system. The implementation uses a single copy of an TV- 
item array, called the valid array; the implementation also 
uses two copies of the data array for temporaries. A one- 
dimensional shuffling FFT algorithm for N data items is 
represented by a butterfly graph with log2 N stages. These 
N data items are stored sequentially in the global memory. 
Each processor is responsible for computing the FFT on 
its chunk, which contains ^ data items. In this algorithm, 
computations of partial FFTs alternate with shuffling phases 
where data are passed among processors. At most two pro- 
cessors can share a block at any time. Coherence activity is 
significantly reduced since data locality exists. 

Figure 2 presents the shuffling FFT algorithm for iV = 16 
and P z= 4. During the computation phase, each local 
processor owns ^ data elements and computes in a butterfly 
fashion. Write operations always occur within the butterfly 
computation phase and never occur in the shuffling phase, 
and each shared block can be written by one processor only. 
Each block is shared by at most two processors when the 
cache line size 5 << A'^/P precondition is held. If the cache 
line size is too large then data elements stored in a block may 
be accessed by more than two processors, this may cause 
additional write invalidations than when the precondition is 
held. 

The algorithm requires a total of 2 x |" |°^' ^ 1 iterations; 
each iteration consists of a computation stage and a shuffling 
stage. Figure 2 shows data partitioning, synchronization, 
butterfly computation data flow, and shuffling directions. 
Each shared block is generally shared by only two proces- 
sors, except the first ^ data elements on the first processor 
and the last jp data elements on the last processor. The first 
and last processors processors are different because they 
have one neighboring processor, unlike others, which have 
two neighboring processors. 

At the beginning of execution, all processors load their 
N/P data items into their respective local caches. During 
the computation phase of the algorithm, each data element 
has to be read and updated locally. Updated data elements 
are stored into the valid array, and synchronization takes 
place. 

During the shuffling phase, the data elements in the valid 
array are copied into a local temporary array, then restored 
into relative locations of the valid array. The relative lo- 
cations to which a data element must shuffle is dependent 
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Figure 2: Shuffling FFT algorithm for P=4 and N=16. 

on its processor's location. All data elements are virtually 
divided into two halves: the first half of data elements are 
shuffled towards the other half, and visa versa. Synchro- 
nization is denoted by the dotted line in the figure, and is 
required before each computation and shuffling phase. 

All reads and writes happen locally and no cache misses 
occur during the computation phase. In the shuffling phase, 
the order of shuffling operations is remote read, local write, 
local read, and remote write. Cache misses only occur during 
the remote read of the shuffling phase. This indicates there 
is at most one cache miss per cache block, and Wij = 1. 
The only exception is the N/2P data elements mentioned in 
the first and the last processor, whose Wij = 0. 

4.1.1 Cache Miss Ratio Validation for Shuffling FFT 

In order to calculate the cache miss ratio for the shuffling 
FFT algorithm one has only to focus on each iteration of 
both the computation and the shuffling stages. The behavior 
of each iteration is independent throughout the whole course 
of execution, which comprises [ j°^' ^ 1 stages. 

We will first compute the miss ratio of the non-boundary 
processors, and then adjust the miss rate to account for the 
first and last processor's access patterns. Within the compu- 
tation phase, \og2{N/P) butterfly stages are needed to com- 
pute y data elements. Each processor performs only two 
reads and one write on each data element in the computation 
phase, and two reads and two writes in the shuffling phase. 



Therefore, the total number of accesses on 6,j, namely a,j, 
during each stage is (3 ■ logj ^ -t- 4) • J9. 

Cache misses happen only in the shuffling phase, where 
Wij = \. Therefore, for the non-boundary processors. 

Hi - 1, Vi = 2,3,...,P-l 



and 
Therefore, 



Wij = 1 



Sij = 1 



Wij 

ma = = 



ifj 



aij fl.(3-log2f +4) 



And, the cache miss ratio for each processor is 
1 1 



Mi = 



5'(3.1og2f -f-4) 



(1) 



The effect of the first and the last processor can be included 
as follows. Recall that processor 1 and processor P have two 
distinct types of blocks: those that have Wij = 1 and those 
that have w.j = 0. Each processor has A'^/2/' blocks of each 
type. Therefore, the miss ratio of the first processor and the 
last processor is: 



M,p = x-B ■ 



1 



25'(3-log2f +4) 
The average miss ratio for the whole system is given by. 



P — 1 2 

M = —p-Mi -f -pMi.p 



Simplifying, we get. 



M = 



P-1 



PB (3.1og2f+4) 



Table 2 compares the cache miss ratios derived from this 
function and those obtained from simulation results pub- 
lished in [5, 7]. The cache miss ratios are collected based 
on cache line sizes of one, two, four, eight, and 16 words, 
number of processors varying from two through eight, and a 
problem size N — 65536. The comparisons are also shown 
in Figure 3. A constant percentage difference is found be- 
tween simulated results and modeled results for each number 
of processors. The observed mean difference is 4.3 percent. 

4.2 Parallel Dynamic Quicksort 

Dynamic quicksort is a divide-and-conquer algorithm, 
which sorts an array A[l], A[2], ..., A[N] by rearrang- 
ing it to make the condition that A[l],...,A[j-l] < A[j] < 
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Table 2: DRM versus simulation for shuffling FFT 
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Figure 3: DRM versus simulation for shuffling FFT. 



A[j+1],...,A[N] holds for some j, and a splitting process 
to place the two subarrays into a global job queue. By 
recursively applying the same procedure to the subarrays 
A[l],...A[j-l] and A[j+1],...^[N], the entire array is sorted. 
In its parallel implementation, at the end of each splitting 
phase, the larger subarray is processed by the same processor 
and a descriptor of the smaller subarray is stored into a global 
job queue for potential distribution to other processors. The 
larger subarray is retained to maximize data locality. When 
a processor is idle it checks the global job queue and picks up 
a subarray's descriptor if the global job queue is not empty. 
If the size of the subarray is one then the processor writes 
the subarray back to shared array and change its state to idle. 
The algorithm is completed when all processors are idle and 
the job queue is empty. 

The problem size of the dynamic quicksort algorithm 
is the N item array A. In this algorithm, every data item 
in this array may be accessed by all processors during the 
course of execution. The implementation of this algorithm 
uses a single copy of the array. There is no system wide 
synchronization required for this algorithm during run time. 
The only restriction on data sharing is that shared data are 
accessed mutually exclusively while a processor splits the 
array or the subarray. 

Figure 4 shows how shared data elements are accessed 
during the execution of the algorithm, for N = 32 and 
P = 4. Each square represents a data element in the array. 
Squares grouped by a rectangle in the graph shows those data 
elements are allocated and sorted by a processor within that 
stage. The dotted line shows data locality after the initial 
loading. For simplicity, we assume that the subarrays are of 
equal size. 

During run time, the first processor (say, PI) allocates all 
data elements initially, and releases less than half of those 
elements at the end of stage 1. The second processor (say, 
P2) then picks a subarray from the job queue and sorts the 
subarray, and other processors follow the same task. After 
stage 1, there are {N - l)/2 unsorted elements on each 
processor on average, and the N element array has been 
divided into 2 subarrays. After stage x, there we {N - 
2* ~ ^ -I- 1 ) / (2" ~ ^ ) unsorted elements on each processor, and 
the array has been divided into 2^~^ subarrays. When the 
number of unsorted elements on each processor reaches one, 
the execution is completed. The average run time of the 
algorithm is log2 A'^ stages (riog2(A'^ - 1)1 iterations). 

4.2.1 Cache Miss Ratio and Model Validation 

Cache misses in the quicksort algorithm have a high degree 
of data dependency. That is, it is impossible to figure out 
exactly thenumber of cache misses without complete knowl- 
edge of the data being sorted. However, we can proceed to 
formulate a DRM for this type of algorithm by making suit- 
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Figure 4: Shared data handling in dynamic quicksort, P = 4 and N = 32 



able assumptions about the data distribution. For simplicity, 
we make the assumption that the array at each stage is evenly 
divided into subarrays, since we do not know the exact size 
of the two sorted subarrays. The assumption of even divi- 
sion represents the worst-case scenario; a 65-35 split is more 
likely if the numbers in the array are generated randomly. 
Under this assumption, we can derive DRM parameters and 
then estimate the cache miss ratio for the algorithm. 

At stage X, the A'^ element array has 2^"^ - 1 sorted 
data elements, an average queue size of 2^~^ — P, and 
2^~^ subarrays. The average size of a unsorted subarray 
is therefore ^~jrri'+' elements. To derive the cache miss 
ratio for a processor in this algorithm, since only one type 
of shared block exists, it is convenient to consider the whole 
course of the execution and derive expressions for /, a, and 
10 for this type of shared block. Clearly, the value of n^ is 
one for all i, and the value of s is also one, since only one 
type of shared block exists. 

Here, we can ignore the work queue's data blocks, since 
it has the same access pattern as the shared data blocks, 
and hence will not change the cache miss ratio. However, 
we need to taking the work queue into account if we are 
estimating the cache miss count. 

There are three types of cache misses that occur during 
the execution of this algorithm: 

• initial loading misses 

• loading misses from the job queue, which correspond 
to cache misses that occur when a subarray is stored in 
the unsorted job queue and requested by a processor 

• invalidation and false-sharing misses within a cache 
block on account of accesses by different processors. 

The sum of the first two comprise the / component of 
misses, and the third type constitutes the w component. 

Given a cache line size B, the number of initial loading 
misses is simply \N/B] . 

The loading misses from the job queue as computed as 
follows. After logj P stages, all processors on the system 



become active and the job queue size remains empty, because 
an idle processor grabs a sorted subarray from the job queue 
as soon as a sorted subarray is inserted by an active processor. 
At stage X, there are a total of 2"""^ - P subarrays in the 
queue, with an average size of {N - 2"-'^ -f l)/(-S ■ 2"-^) 
blocks. The probability that a subarray is inserted and taken 
again from the queue is \/P. Consequently, the probability 
that reloading a subarray misses in the cache is (P - 1)/P. 
Overall, from stage log2 P -t- 2 to stage log2 N, the number 
of such reloading misses is: 



(- 



1) . iV-2^-'-f 1 



(2"=-' - P). 



The sum of the above two components is /. 

As the algorithm completes, if the size of a subarray is 
smaller than the cache line size, then write operations to a 
subarray may cause write invalidations on other subarrays 
within the same cache block. These misses caused by false 
sharing within a cache block usually occur when the size of 
subarrays is very small. A total of 
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misses occur from stage logjiV - {log2B -f 1) to stage 
logj iV. These comprise the w component of misses. 

An approximate total cache access count in stage x is 11 • 
( iV 4- 2"^ ~ ' - 1 ) / (2^ ) , which involves 7 shared data accesses 
and 4 local variable accesses for each data element in the 
subarray. The constant access of 11 may be vary depends on 
the implementation of algorithm. The sum of these values 

over all the stages, namely, E'°=i^[11 • ^+^'"' 1. is the 
value of a. 

Since there is only one type of shared blocks in the system, 
mij := Mi = M. Thus M is simply (/ + w) divided by 
the total cache access count a. Therefore, 
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which is simplified to yield, 
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Table 3: Expression for the miss rate in quicksort. 
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Table 4: Simulation versus DRM result listing for quicksort 

and, substituting for /, w, and a, we obtain the miss rate as 
depicted in Table 3. 

Table 4 compares the cache miss ratios derived from this 
function and the miss rates from simulations. Simulations 
use cache line sizes of one, two, four, eight, and 16 words, 
processor numbers of two, four, and eight, and a problem 
size of 32678. The comparisons are also shown in Figure 5. 
An average difference of 8.5 percent was found between 
simulated results and modeled results. We believe the dif- 
ference is primarily due to the unpredictability of the input 
data set, as discussed in Section 4.2.1. Nevertheless, we find 
that the analysis is still reasonable for this type of algorithm 
with suitable assumptions about data distributions. 

4.3 Summary of Results for Six Algorithms 

Table 5 lists the differences between the analytically obtained 
miss ratios and the miss ratios obtained through simulations 
for the six algorithms studied. We observe that the data 
reference modeling approach is fairly accurate in estimating 
the cache miss ratios for all studied cases, and is generally 
significantly more accurate than the Access Burst Model [5] 
for the same set of benchmarks. For example, the data 
reference modeling approach yields average and maximum 
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Figure 5: DRM versus simulation for quicksort. 

errors of 0.63 and 1.30 percent respectively for Jacobi, while 
the Access Burst Model yields average and maximum errors 
of 10.03 and 17.82 percent respectively. 

The average is the absolute difference divided by the sim- 
ulation data, abs(model-sim)/sim. Maximum is the absolute 
maximum difference. 

Since our approach essentially does a mental simulation 
of algorithms, why are the differences not zero? There 
are several reasons for the mismatch between simulations 
and analysis. In algorithms where the cache miss ratios is 
data dependent (quicksort and image labeling), the errors are 
relatively larger than the others because of the lack of fealty 
in the assumptions about data distributions. In the other 
algorithms, the differences are much smaller; we believe 
these can be attributed to our incomplete knowledge of the 
specific implementations of the algorithms. 

5 Predicting The Cache Miss Ratio 

Armed with the knowledge that the data reference modeling 
method is acceptably accurate in predicting cache miss ra- 
tios for several system configurations, we can now apply the 
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Figure 8: Cache miss ratios for quicksort. 
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problem size, the number of processors, and the cache line 
size. By analyzing the algorithm, we can accurately capture 
the impact of the system configuration and problem size on 
the cache miss ratio. This approach is different from meth- 
ods developed by others in that it does not analyze an address 
trace. We found that using parameters measured from ad- 
dress traces alone allows us to study applications behavior 
under a similar system environment where the trace are de- 
rived. Furthermore, even when the application behavior is 
similar, we have found it to be virtually impossible to pre- 
dict the miss rate as various system parameters are changed 
without considering algorithm characteristics. 

After showing that the data reference modeling approach 
is accurate and is not too onerous to use, we used the model 
to predict the cache miss ratios for different system config- 
urations. 



Figure 7: Cache miss raUo for shuffling FFT, B=4 and P=16. 7 Acknowledgments 



One of the most interesting features of these graphs is 
the relationship between cache miss ratio and the problem 
size for a given cache line sizes. Figure 8(b), corresponding 
to B = 1, shows that when the problem size increases the 
cache miss ratio also increases. Figure 8(c), corresponding 
to B = 4, shows that the cache miss ratio is largely insensitive 
to the problem size. Figure 8(d), corresponding to B = 16, 
shows that when the problem size increases the cache miss 
ratio decreases. 

The reason for this behavior is that a larger cache line size 
(greater than four) reduces the cache miss ratio significantly 
during the early stages of algorithm execution, when the 
first-time miss cost is most significant. On the other hand, 
a larger cache line size pays a higher cache miss penalty 
toward the end of the execution, when coherence related 
invalidations abound. Notice that for the B=I6 case, the flat 
miss ratio curves indicate that the cache miss ratio is less 
sensitive to the number of processors than when the cache 
line size is large. These results suggests that a cache line 
size greater than or equal to four is preferable to smaller line 
sizes. 



6 Conclusions 

This paper presented a data reference modeling approach to 
computing the cache miss ratios of multiprocessors for dif- 
ferent algorithms. The method involves a mental simulation 
of the algorithm. The approach is validated by comparing 
analytically obtained results with those from simulations. 

The modeling approach analyzes algorithms to derive a 
few parameters that capture the processor locality in appli- 
cations. These parameters are expressed as a function of the 
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